Source of the materials: Biopython cookbook (adapted) Status: Draft
Bio.PDB is a Biopython module that focuses on working with crystal structures of biological macromolecules. Among other things, Bio.PDB includes a PDBParser class that produces a Structure object, which can be used to access the atomic data in the file in a convenient manner. There is limited support for parsing the information contained in the PDB header.
First we create a PDBParser
object:
In [1]:
from Bio.PDB.PDBParser import PDBParser
p = PDBParser(PERMISSIVE=1)
The PERMISSIVE flag indicates that a number of common problems (see [problem structures]) associated with PDB files will be ignored (but note that some atoms and/or residues will be missing). If the flag is not present a PDBConstructionException will be generated if any problems are detected during the parse operation.
The Structure object is then produced by letting the PDBParser
object
parse a PDB file (the PDB file in this case is called ’pdb1fat.ent’,
’1fat’ is a user defined name for the structure):
In [2]:
structure_id = "1fat"
filename = "data/pdb1fat.ent"
structure = p.get_structure(structure_id, filename)
You can extract the header and trailer (simple lists of strings) of the
PDB file from the PDBParser object with the get\_header and
get\_trailer methods. Note however that many PDB files
contain headers with incomplete or erroneous information. Many of the
errors have been fixed in the equivalent mmCIF files. Hence, if you are
interested in the header information, it is a good idea to extract
information from mmCIF files using the MMCIF2Dict
tool described
below, instead of parsing the PDB header.
Now that is clarified, let’s return to parsing the PDB header. The
structure object has an attribute called header
which is a Python
dictionary that maps header records to their values.
Example:
In [3]:
resolution = structure.header['resolution']
keywords = structure.header['keywords']
The available keys are name
, head
, deposition_date
,
release_date
, structure_method
, resolution
, structure_reference
(which maps to a list of references), journal_reference
, author
, and
compound
(which maps to a dictionary with various information about
the crystallized compound).
The dictionary can also be created without creating a Structure
object, ie. directly from the PDB file:
In [4]:
file = open(filename, 'r')
header_dict = parse_pdb_header(file)
file.close()
In [5]:
from Bio.PDB.MMCIFParser import MMCIFParser
parser = MMCIFParser()
Then use this parser to create a structure object from the mmCIF file:
In [6]:
structure = parser.get_structure('1fat', 'data/1fat.cif')
To have some more low level access to an mmCIF file, you can use the
MMCIF2Dict
class to create a Python dictionary that maps all mmCIF
tags in an mmCIF file to their values. If there are multiple values
(like in the case of tag _atom_site.Cartn_y
, which holds the $y$
coordinates of all atoms), the tag is mapped to a list of values. The
dictionary is created from the mmCIF file as follows:
In [7]:
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
mmcif_dict = MMCIF2Dict('data/1fat.cif')
Example: get the solvent content from an mmCIF file:
In [8]:
sc = mmcif_dict['_exptl_crystal.density_percent_sol']
Example: get the list of the $y$ coordinates of all atoms
In [9]:
y_list = mmcif_dict['_atom_site.Cartn_y']
That’s not yet supported, but we are definitely planning to support that in the future (it’s not a lot of work). Contact the Biopython developers () if you need this).
Use the PDBIO class for this. It’s easy to write out specific parts of a structure too, of course.
Example: saving a structure
In [11]:
from Bio.PDB import PDBIO
io = PDBIO()
io.set_structure(s)
io.save('out.pdb')
If you want to write out a part of the structure, make use of the
Select
class (also in PDBIO
). Select has four methods:
accept_model(model)
accept_chain(chain)
accept_residue(residue)
accept_atom(atom)
By default, every method returns 1 (which means the
model/chain/residue/atom is included in the output). By subclassing
Select
and returning 0 when appropriate you can exclude models,
chains, etc. from the output. Cumbersome maybe, but very powerful. The
following code only writes out glycine residues:
In [14]:
from Bio.PDB.PDBIO import Select
class GlySelect(Select):
def accept_residue(self, residue):
if residue.get_name() == 'GLY':
return True
else:
return False
In [15]:
io = PDBIO()
io.set_structure(s)
io.save('gly_only.pdb', GlySelect())
If this is all too complicated for you, the Dice
module contains a
handy extract
function that writes out all residues in a chain between
a start and end residue.
The overall layout of a Structure
object follows the so-called SMCRA
(Structure/Model/Chain/Residue/Atom) architecture:
A structure consists of models
A model consists of chains
A chain consists of residues
A residue consists of atoms
This is the way many structural biologists/bioinformaticians think about
structure, and provides a simple but efficient way to deal with
structure. Additional stuff is essentially added when needed. A UML
diagram of the Structure
object (forget about the Disordered
classes
for now) is shown in Fig. [fig:smcra]. Such a data structure is not
necessarily best suited for the representation of the macromolecular
content of a structure, but it is absolutely necessary for a good
interpretation of the data present in a file that describes the
structure (typically a PDB or MMCIF file). If this hierarchy cannot
represent the contents of a structure file, it is fairly certain that
the file contains an error or at least does not describe the structure
unambiguously. If a SMCRA data structure cannot be generated, there is
reason to suspect a problem. Parsing a PDB file can thus be used to
detect likely problems. We will give several examples of this in section
[problem structures].
Structure, Model, Chain and Residue are all subclasses of the Entity base class. The Atom class only (partly) implements the Entity interface (because an Atom does not have children).
For each Entity subclass, you can extract a child by using a unique id for that child as a key (e.g. you can extract an Atom object from a Residue object by using an atom name string as a key, you can extract a Chain object from a Model object by using its chain identifier as a key).
Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue classes, which are both subclasses of the DisorderedEntityWrapper base class. They hide the complexity associated with disorder and behave exactly as Atom and Residue objects.
In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively) by using an id as a key.
In [16]:
child_entity = parent_entity[child_id]
You can also get a list of all child Entities of a parent Entity object. Note that this list is sorted in a specific way (e.g. according to chain identifier for Chain objects in a Model object).
In [ ]:
child_list = parent_entity.get_list()
You can also get the parent from a child:
In [17]:
parent_entity = child_entity.get_parent()
At all levels of the SMCRA hierarchy, you can also extract a full id. The full id is a tuple containing all id’s starting from the top object (Structure) down to the current object. A full id for a Residue object e.g. is something like:
In [18]:
full_id = residue.get_full_id()
print(full_id)
This corresponds to:
The Structure with id `"1abc`"
The Model with id 0
The Chain with id `"A`"
The Residue with id (`" `", 10, `"A`").
The Residue id indicates that the residue is not a hetero-residue (nor a water) because it has a blank hetero field, that its sequence identifier is 10 and that its insertion code is `"A`".
To get the entity’s id, use the get_id
method:
In [19]:
entity.get_id()
You can check if the entity has a child with a given id by using the
has_id
method:
In [ ]:
entity.has_id(entity_id)
The length of an entity is equal to its number of children:
In [ ]:
nr_children = len(entity)
It is possible to delete, rename, add, etc. child entities from a parent entity, but this does not include any sanity checks (e.g. it is possible to add two residues with the same id to one chain). This really should be done via a nice Decorator class that includes integrity checking, but you can take a look at the code (Entity.py) if you want to use the raw interface.
The Structure object is at the top of the hierarchy. Its id is a user given string. The Structure contains a number of Model children. Most crystal structures (but not all) contain a single model, while NMR structures typically consist of several models. Disorder in crystal structures of large parts of molecules can also result in several models.
The id of the Model object is an integer, which is derived from the
position of the model in the parsed file (they are automatically
numbered starting from 0). Crystal structures generally have only one
model (with id 0), while NMR files usually have several models. Whereas
many PDB parsers assume that there is only one model, the Structure
class in Bio.PDB
is designed such that it can easily handle PDB files
with more than one model.
As an example, to get the first model from a Structure object, use
In [20]:
first_model = structure[0]
The Model object stores a list of Chain children.
The id of a Chain object is derived from the chain identifier in the PDB/mmCIF file, and is a single character (typically a letter). Each Chain in a Model object has a unique id. As an example, to get the Chain object with identifier “A” from a Model object, use
In [21]:
chain_A = model["A"]
The Chain object stores a list of Residue children.
A residue id is a tuple with three elements:
The hetero-field (hetfield): this is
'W'
in the case of a water molecule;
'H_'
followed by the residue name for other hetero
residues (e.g. 'H_GLC'
in the case of a glucose molecule);
blank for standard amino and nucleic acids.
This scheme is adopted for reasons described in section [hetero problems].
The sequence identifier (resseq), an integer describing the position of the residue in the chain (e.g., 100);
The insertion code (icode); a string, e.g. ’A’. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as follows: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure.
The id of the above glucose residue would thus be (’H_GLC’, 100, ’A’)
.
If the hetero-flag and insertion code are blank, the sequence identifier
alone can be used:
In [22]:
# Full id
residue = chain[(' ', 100, ' ')]
In [23]:
residue = chain[100]
The reason for the hetero-flag is that many, many PDB files use the same sequence identifier for an amino acid and a hetero-residue or a water, which would create obvious problems if the hetero-flag was not used.
Unsurprisingly, a Residue object stores a set of Atom children. It also contains a string that specifies the residue name (e.g. “ASN”) and the segment identifier of the residue (well known to X-PLOR users, but not used in the construction of the SMCRA data structure).
Let’s look at some examples. Asn 10 with a blank insertion code would have residue id (’ ’, 10, ’ ’). Water 10 would have residue id (’W’, 10, ’ ’). A glucose molecule (a hetero residue with residue name GLC) with sequence identifier 10 would have residue id (’H\_GLC’, 10, ’ ’). In this way, the three residues (with the same insertion code and sequence identifier) can be part of the same chain because their residue id’s are distinct.
In most cases, the hetflag and insertion code fields will be blank, e.g. (’ ’, 10, ’ ’). In these cases, the sequence identifier can be used as a shortcut for the full id:
In [ ]:
# use full id
res10 = chain[(' ', 10, ' ')]
In [ ]:
res10 = chain[10]
Each Residue object in a Chain object should have a unique id. However, disordered residues are dealt with in a special way, as described in section [point mutations].
A Residue object has a number of additional methods:
In [24]:
residue.get_resname() # returns the residue name, e.g. "ASN"
residue.is_disordered() # returns 1 if the residue has disordered atoms
residue.get_segid() # returns the SEGID, e.g. "CHN1"
residue.has_id(name) # test if a residue has a certain atom
You can use is_aa(residue)
to test if a Residue object is an amino
acid.
The Atom object stores the data associated with an atom, and has no children. The id of an atom is its atom name (e.g. “OG” for the side chain oxygen of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception is made for disordered atoms, as described in section [disordered atoms].
The atom id is simply the atom name (eg. ’CA’
). In practice, the atom
name is created by stripping all spaces from the atom name in the PDB
file.
However, in PDB files, a space can be part of an atom name. Often,
calcium atoms are called ’CA..’
in order to distinguish them from
C$\alpha$ atoms (which are called ’.CA.’
). In cases were stripping the
spaces would create problems (ie. two atoms called ’CA’
in the same
residue) the spaces are kept.
In a PDB file, an atom name consists of 4 chars, typically with leading and trailing spaces. Often these spaces can be removed for ease of use (e.g. an amino acid C$ \alpha $ atom is labeled “.CA.” in a PDB file, where the dots represent spaces). To generate an atom name (and thus an atom id) the spaces are removed, unless this would result in a name collision in a Residue (i.e. two Atom objects with the same atom name and id). In the latter case, the atom name including spaces is tried. This situation can e.g. happen when one residue contains atoms with names “.CA.” and “CA..”, although this is not very likely.
The atomic data stored includes the atom name, the atomic coordinates (including standard deviation if present), the B factor (including anisotropic B factors and standard deviation if present), the altloc specifier and the full atom name including spaces. Less used items like the atom element number or the atomic charge sometimes specified in a PDB file are not stored.
To manipulate the atomic coordinates, use the transform
method of the
Atom
object. Use the set_coord
method to specify the atomic
coordinates directly.
An Atom object has the following additional methods:
In [25]:
a.get_name() # atom name (spaces stripped, e.g. "CA")
a.get_id() # id (equals atom name)
a.get_coord() # atomic coordinates
a.get_vector() # atomic coordinates as Vector object
a.get_bfactor() # isotropic B factor
a.get_occupancy() # occupancy
a.get_altloc() # alternative location specifier
a.get_sigatm() # standard deviation of atomic parameters
a.get_siguij() # standard deviation of anisotropic B factor
a.get_anisou() # anisotropic B factor
a.get_fullname() # atom name (with spaces, e.g. ".CA.")
To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy arrays are used.
The get_vector
method returns a Vector
object representation of the
coordinates of the Atom
object, allowing you to do vector operations
on atomic coordinates. Vector
implements the full set of 3D vector
operations, matrix multiplication (left and right) and some advanced
rotation-related operations as well.
As an example of the capabilities of Bio.PDB’s Vector
module, suppose
that you would like to find the position of a Gly residue’s C$\beta$
atom, if it had one. Rotating the N atom of the Gly residue along the
C$\alpha$-C bond over -120 degrees roughly puts it in the position of a
virtual C$\beta$ atom. Here’s how to do it, making use of the rotaxis
method (which can be used to construct a rotation around a certain axis)
of the Vector
module:
In [26]:
# get atom coordinates as vectors
n = residue['N'].get_vector()
c = residue['C'].get_vector()
ca = residue['CA'].get_vector()
In [27]:
n = n - ca
c = c - ca
In [28]:
rot = rotaxis(-pi * 120.0/180.0, c)
In [29]:
cb_at_origin = n.left_multiply(rot)
In [30]:
cb = cb_at_origin + ca
This example shows that it’s possible to do some quite nontrivial vector
operations on atomic data, which can be quite useful. In addition to all
the usual vector operations (cross (use **
), and dot (use *
)
product, angle, norm, etc.) and the above mentioned rotaxis
function,
the Vector
module also has methods to rotate (rotmat
) or reflect
(refmat
) one vector on top of another.
Atom/Residue/Chain/Model
from a StructureThese are some examples:
In [31]:
model = structure[0]
chain = model['A']
residue = chain[100]
atom = residue['CA']
Note that you can use a shortcut:
In [32]:
atom = structure[0]['A'][100]['CA']
Bio.PDB can handle both disordered atoms and point mutations (i.e. a Gly and an Ala residue in the same position).
Disorder should be dealt with from two points of view: the atom and the residue points of view. In general, we have tried to encapsulate all the complexity that arises from disorder. If you just want to loop over all C$\alpha$ atoms, you do not care that some residues have a disordered side chain. On the other hand it should also be possible to represent disorder completely in the data structure. Therefore, disordered atoms or residues are stored in special objects that behave as if there is no disorder. This is done by only representing a subset of the disordered atoms or residues. Which subset is picked (e.g. which of the two disordered OG side chain atom positions of a Ser residue is used) can be specified by the user.
Disordered atoms are represented by ordinary Atom
objects, but all
Atom
objects that represent the same physical atom are stored in a
DisorderedAtom
object (see Fig. [fig:smcra]). Each Atom
object in
a DisorderedAtom
object can be uniquely indexed using its altloc
specifier. The DisorderedAtom
object forwards all uncaught method
calls to the selected Atom object, by default the one that represents
the atom with the highest occupancy. The user can of course change the
selected Atom
object, making use of its altloc specifier. In this way
atom disorder is represented correctly without much additional
complexity. In other words, if you are not interested in atom disorder,
you will not be bothered by it.
Each disordered atom has a characteristic altloc identifier. You can
specify that a DisorderedAtom
object should behave like the Atom
object associated with a specific altloc identifier:
In [33]:
atom.disordered_select('A') # select altloc A atom
print(atom.get_altloc())
In [34]:
atom.disordered_select('B') # select altloc B atom
print(atom.get_altloc())
The most common case is a residue that contains one or more disordered atoms. This is evidently solved by using DisorderedAtom objects to represent the disordered atoms, and storing the DisorderedAtom object in a Residue object just like ordinary Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in fact the atom with the highest occupancy) by forwarding all uncaught method calls to one of the Atom objects (the selected Atom object) it contains.
A special case arises when disorder is due to a point mutation, i.e. when two or more point mutants of a polypeptide are present in the crystal. An example of this can be found in PDB structure 1EN2.
Since these residues belong to a different residue type (e.g. let’s say
Ser 60 and Cys 60) they should not be stored in a single Residue
object as in the common case. In this case, each residue is represented
by one Residue
object, and both Residue
objects are stored in a
single DisorderedResidue
object (see Fig. [fig:smcra]).
The DisorderedResidue
object forwards all uncaught methods to the
selected Residue
object (by default the last Residue
object added),
and thus behaves like an ordinary residue. Each Residue
object in a
DisorderedResidue
object can be uniquely identified by its residue
name. In the above example, residue Ser 60 would have id “SER” in the
DisorderedResidue
object, while residue Cys 60 would have id “CYS”.
The user can select the active Residue
object in a DisorderedResidue
object via this id.
Example: suppose that a chain has a point mutation at position 10, consisting of a Ser and a Cys residue. Make sure that residue 10 of this chain behaves as the Cys residue.
In [35]:
residue = chain[10]
residue.disordered_select('CYS')
In addition, you can get a list of all Atom
objects (ie. all
DisorderedAtom
objects are ’unpacked’ to their individual Atom
objects) using the get_unpacked_list
method of a (Disordered)Residue
object.
A common problem with hetero residues is that several hetero and non-hetero residues present in the same chain share the same sequence identifier (and insertion code). Therefore, to generate a unique id for each hetero residue, waters and other hetero residues are treated in a different way.
Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank (“ ”) for amino and nucleic acids, and a string for waters and other hetero residues. The content of the hetfield is explained below.
The hetfield string of a water residue consists of the letter “W”. So a typical residue id for a water is (“W”, 1, “ ”).
The hetfield string for other hetero residues starts with “H_” followed by the residue name. A glucose molecule e.g. with residue name “GLC” would have hetfield “H_GLC”. Its residue id could e.g. be (“H_GLC”, 1, “ ”).
In [39]:
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
structure = parser.get_structure("test", "data/pdb1fat.ent")
model = structure[0]
chain = model["A"]
residue = chain[1]
atom = residue["CA"]
In [40]:
p = PDBParser()
structure = p.get_structure('X', 'data/pdb1fat.ent')
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
print(atom)
There is a shortcut if you want to iterate over all atoms in a structure:
In [41]:
atoms = structure.get_atoms()
for atom in atoms:
print(atom)
Similarly, to iterate over all atoms in a chain, use
In [42]:
atoms = chain.get_atoms()
for atom in atoms:
print(atom)
In [43]:
residues = model.get_residues()
for residue in residues:
print(residue)
You can also use the Selection.unfold_entities
function to get all
residues from a structure:
In [45]:
from Bio.PDB import Selection
res_list = Selection.unfold_entities(structure, 'R')
or to get all atoms from a chain:
In [46]:
atom_list = Selection.unfold_entities(chain, 'A')
Obviously, A=atom, R=residue, C=chain, M=model, S=structure
. You can
use this to go up in the hierarchy, e.g. to get a list of (unique)
Residue
or Chain
parents from a list of Atoms
:
In [47]:
residue_list = Selection.unfold_entities(atom_list, 'R')
chain_list = Selection.unfold_entities(atom_list, 'C')
In [50]:
residue_id = ("H_GLC", 10, " ")
residue = chain[residue_id]
In [51]:
for residue in chain.get_list():
residue_id = residue.get_id()
hetfield = residue_id[0]
if hetfield[0]=="H":
print(residue_id)
In [52]:
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.has_id("CA"):
ca = residue["CA"]
if ca.get_bfactor() > 50.0:
print(ca.get_coord())
In [53]:
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.is_disordered():
resseq = residue.get_id()[1]
resname = residue.get_resname()
model_id = model.get_id()
chain_id = chain.get_id()
print(model_id, chain_id, resname, resseq)
In [54]:
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.is_disordered():
for atom in residue.get_list():
if atom.is_disordered() and atom.disordered_has_id("A"):
atom.disordered_select("A")
Structure
object[subsubsec:extracting_polypeptides] {#extracting-polypeptides-from-a-structure-objectsubsubsecextracting_polypeptides .unnumbered}To extract polypeptides from a structure, construct a list of
Polypeptide
objects from a Structure
object using
PolypeptideBuilder
as follows:
In [55]:
model_nr = 1
polypeptide_list = build_peptides(structure, model_nr)
for polypeptide in polypeptide_list:
print(polypeptide)
A Polypeptide object is simply a UserList of Residue objects, and is
always created from a single Model (in this case model 1). You can use
the resulting Polypeptide
object to get the sequence as a Seq
object
or to get a list of C$\alpha$ atoms as well. Polypeptides can be built
using a C-N or a C$\alpha$-C$\alpha$ distance criterion.
Example:
In [ ]:
# Using C-N
ppb = PPBuilder()
for pp in ppb.build_peptides(structure):
print(pp.get_sequence())
In [ ]:
ppb = CaPPBuilder()
for pp in ppb.build_peptides(structure):
print(pp.get_sequence())
Note that in the above case only model 0 of the structure is considered
by PolypeptideBuilder
. However, it is possible to use
PolypeptideBuilder
to build Polypeptide
objects from Model
and
Chain
objects as well.
The first thing to do is to extract all polypeptides from the structure
(as above). The sequence of each polypeptide can then easily be obtained
from the Polypeptide
objects. The sequence is represented as a
Biopython Seq
object, and its alphabet is defined by a
ProteinAlphabet
object.
Example:
In [56]:
seq = polypeptide.get_sequence()
print(seq)
In [57]:
# Get some atoms
ca1 = residue1['CA']
ca2 = residue2['CA']
In [ ]:
distance = ca1-ca2
In [58]:
vector1 = atom1.get_vector()
vector2 = atom2.get_vector()
vector3 = atom3.get_vector()
angle = calc_angle(vector1, vector2, vector3)
In [59]:
vector1 = atom1.get_vector()
vector2 = atom2.get_vector()
vector3 = atom3.get_vector()
vector4 = atom4.get_vector()
angle = calc_dihedral(vector1, vector2, vector3, vector4)
Use NeighborSearch
to perform neighbor lookup. The neighbor lookup is
done using a KD tree module written in C (see Bio.KDTree
), making it
very fast. It also includes a fast method to find all point pairs within
a certain distance of each other.
Use a Superimposer
object to superimpose two coordinate sets. This
object calculates the rotation and translation matrix that rotates two
lists of atoms on top of each other in such a way that their RMSD is
minimized. Of course, the two lists need to contain the same number of
atoms. The Superimposer
object can also apply the rotation/translation
to a list of atoms. The rotation and translation are stored as a tuple
in the rotran
attribute of the Superimposer
object (note that the
rotation is right multiplying!). The RMSD is stored in the rmsd
attribute.
The algorithm used by Superimposer
comes from @golub1989 [Golub & Van
Loan] and makes use of singular value decomposition (this is implemented
in the general Bio.SVDSuperimposer
module).
Example:
In [61]:
from Bio.PDB import Superimposer
sup = Superimposer()
In [62]:
sup.set_atoms(fixed, moving)
In [63]:
print(sup.rotran)
print(sup.rms)
In [64]:
sup.apply(moving)
To superimpose two structures based on their active sites, use the active site atoms to calculate the rotation/translation matrices (as above), and apply these to the whole molecule.
First, create an alignment file in FASTA format, then use the
StructureAlignment
class. This class can also be used for alignments
with more than two structures.
Half Sphere Exposure (HSE) is a new, 2D measure of solvent exposure @hamelryck2005. Basically, it counts the number of C$\alpha$ atoms around a residue in the direction of its side chain, and in the opposite direction (within a radius of $13 \AA$). Despite its simplicity, it outperforms many other measures of solvent exposure.
HSE comes in two flavors: HSE$\alpha$ and HSE$\beta$. The former only
uses the C$\alpha$ atom positions, while the latter uses the C$\alpha$
and C$\beta$ atom positions. The HSE measure is calculated by the
HSExposure
class, which can also calculate the contact number. The
latter class has methods which return dictionaries that map a Residue
object to its corresponding HSE$\alpha$, HSE$\beta$ and contact number
values.
Example:
In [66]:
from Bio.PDB import HSExposure
model = structure[0]
hse = HSExposure()
In [67]:
exp_ca = hse.calc_hs_exposure(model, option='CA3')
In [ ]:
exp_cb=hse.calc_hs_exposure(model, option='CB')
In [ ]:
exp_fs = hse.calc_fs_exposure(model)
In [ ]:
print(exp_ca[some_residue])
For this functionality, you need to install DSSP (and obtain a license
for it — free for academic use, see http://www.cmbi.kun.nl/gv/dssp/).
Then use the DSSP
class, which maps Residue
objects to their
secondary structure (and accessible surface area). The DSSP codes are
listed in Table [cap:DSSP-codes]. Note that DSSP (the program, and
thus by consequence the class) cannot handle multiple models!
Code Secondary structure
H $\alpha$-helix
B Isolated $\beta$-bridge residue
E Strand
G 3-10 helix
I $\Pi$-helix
T Turn
S Bend
- Other
: [cap:DSSP-codes]DSSP codes in Bio.PDB.
The DSSP
class can also be used to calculate the accessible surface
area of a residue. But see also section [subsec:residue_depth].
Residue depth is the average distance of a residue’s atoms from the
solvent accessible surface. It’s a fairly new and very powerful
parameterization of solvent accessibility. For this functionality, you
need to install Michel Sanner’s MSMS program
(http://www.scripps.edu/pub/olson-web/people/sanner/html/msms_home.html).
Then use the ResidueDepth
class. This class behaves as a dictionary
which maps Residue
objects to corresponding (residue depth, C$\alpha$
depth) tuples. The C$\alpha$ depth is the distance of a residue’s
C$\alpha$ atom to the solvent accessible surface.
Example:
In [69]:
from Bio.PDB import ResidueDepth
model = structure[0]
rd = ResidueDepth(model, pdb_file)
residue_depth, ca_depth=rd[some_residue]
You can also get access to the molecular surface itself (via the
get_surface
function), in the form of a Numeric Python array with the
surface points.
It is well known that many PDB files contain semantic errors (not the structures themselves, but their representation in PDB files). Bio.PDB tries to handle this in two ways. The PDBParser object can behave in two ways: a restrictive way and a permissive way, which is the default.
Example:
In [70]:
# Permissive parser
parser = PDBParser(PERMISSIVE=1)
parser = PDBParser() # The same (default)
In [71]:
strict_parser = PDBParser(PERMISSIVE=0)
In the permissive state (DEFAULT), PDB files that obviously contain errors are “corrected” (i.e. some residues or atoms are left out). These errors include:
Multiple residues with the same identifier
Multiple atoms with the same identifier (taking into account the altloc identifier)
These errors indicate real problems in the PDB file (for details see @hamelryck2003a [Hamelryck and Manderick, 2003]). In the restrictive state, PDB files with errors cause an exception to occur. This is useful to find errors in PDB files.
Some errors however are automatically corrected. Normally each disordered atom should have a non-blank altloc identifier. However, there are many structures that do not follow this convention, and have a blank and a non-blank identifier for two disordered positions of the same atom. This is automatically interpreted in the right way.
Sometimes a structure contains a list of residues belonging to chain A, followed by residues belonging to chain B, and again followed by residues belonging to chain A, i.e. the chains are ’broken’. This is also correctly interpreted.
The PDBParser/Structure class was tested on about 800 structures (each belonging to a unique SCOP superfamily). This takes about 20 minutes, or on average 1.5 seconds per structure. Parsing the structure of the large ribosomal subunit (1FKK), which contains about 64000 atoms, takes 10 seconds on a 1000 MHz PC.
Three exceptions were generated in cases where an unambiguous data structure could not be built. In all three cases, the likely cause is an error in the PDB file that should be corrected. Generating an exception in these cases is much better than running the chance of incorrectly describing the structure in a data structure.
One structure contains two amino acid residues in one chain with the same sequence identifier (resseq 3) and icode. Upon inspection it was found that this chain contains the residues Thr A3, …, Gly A202, Leu A3, Glu A204. Clearly, Leu A3 should be Leu A203. A couple of similar situations exist for structure 1FFK (which e.g. contains Gly B64, Met B65, Glu B65, Thr B67, i.e. residue Glu B65 should be Glu B66).
Structure 1EJG contains a Ser/Pro point mutation in chain A at position
Some errors are quite common and can be easily corrected without much risk of making a wrong interpretation. These cases are listed below.
Normally each disordered atom should have a non-blank altloc identifier. However, there are many structures that do not follow this convention, and have a blank and a non-blank identifier for two disordered positions of the same atom. This is automatically interpreted in the right way.
Sometimes a structure contains a list of residues belonging to chain A, followed by residues belonging to chain B, and again followed by residues belonging to chain A, i.e. the chains are “broken”. This is correctly interpreted.
Sometimes a PDB file cannot be unambiguously interpreted. Rather than guessing and risking a mistake, an exception is generated, and the user is expected to correct the PDB file. These cases are listed below.
All residues in a chain should have a unique id. This id is generated based on:
The sequence identifier (resseq).
The insertion code (icode).
The hetfield string (“W” for waters and “H_” followed by the residue name for other hetero residues)
The residue names of the residues in the case of point mutations (to store the Residue objects in a DisorderedResidue object).
If this does not lead to a unique id something is quite likely wrong, and an exception is generated.
All atoms in a residue should have a unique id. This id is generated based on:
The atom name (without spaces, or with spaces if a problem arises).
The altloc specifier.
If this does not lead to a unique id something is quite likely wrong, and an exception is generated.
Structures can be downloaded from the PDB (Protein Data Bank) by using
the retrieve_pdb_file
method on a PDBList
object. The argument for
this method is the PDB identifier of the structure.
In [73]:
from Bio.PDB import PDBList
pdbl = PDBList()
pdbl.retrieve_pdb_file('1FAT')
Out[73]:
The PDBList
class can also be used as a command-line tool:
python PDBList.py 1fat
The downloaded file will be called pdb1fat.ent
and stored in the
current working directory. Note that the retrieve_pdb_file
method also
has an optional argument pdir
that specifies a specific directory in
which to store the downloaded PDB files.
The retrieve_pdb_file
method also has some options to specify the
compression format used for the download, and the program used for local
decompression (default .Z
format and gunzip
). In addition, the PDB
ftp site can be specified upon creation of the PDBList
object. By
default, the server of the Worldwide Protein Data Bank
(ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) is used.
See the API documentation for more details. Thanks again to Kristian
Rother for donating this module.
The following commands will store all PDB files in the /data/pdb
directory:
python PDBList.py all /data/pdb
python PDBList.py all /data/pdb -d
The API method for this is called download_entire_pdb
. Adding the -d
option will store all files in the same directory. Otherwise, they are
sorted into PDB-style subdirectories according to their PDB ID’s.
Depending on the traffic, a complete download will take 2-4 days.
This can also be done using the PDBList
object. One simply creates a
PDBList
object (specifying the directory where the local copy of the
PDB is present) and calls the update_pdb
method:
In [75]:
pl = PDBList(pdb='/tmp/data/pdb')
pl.update_pdb()
In [ ]: